##Introduction

As a company that specializes in Talent Management we have been assigned in Identifying the top tree factors that lead to employee Attrition/Turnover. Additionally we have been tasked with creating a model that predicts Attrition as well as model that predicts Monthly Income for the corporations employees.

Youtube: https://youtu.be/k0ob1JmyLf4

#Reading and tidying datasets

Reading in training data set

Reading in test data set for attrition

##Reading Test data set for Monthly Income

##Removing unnecessary columns from training set and setting all categorial to be factors

##Training data EDA

Attrition By Department

## # A tibble: 6 x 2
##   Attrition     n
##   <fct>     <int>
## 1 No           29
## 2 Yes           6
## 3 No          487
## 4 Yes          75
## 5 No          214
## 6 Yes          59

Attrition by Age

There seems to be a quadratic trend, there’s a high level of attriction in late teens and early 20s. It levels off in the 30s, and starts picking back up in the 50s

## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).

##Attrition by JobSatisfaction

Seems to be a very strong correlation between JobSatisfaction and attrition rate, with the greater job satisfaction the better less the likelhood for attrition.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

##Attrition by total working years

Similar to age it seems that there is less likelihood

## Warning: Removed 10 rows containing missing values (geom_point).

##Attrition by Job Role Sales representative appear to have a much higher attrition rate

##Attrition by PercentSalaryHike There’s a very small correlation between percent salary hike and attrition

##Attrition by hourly rate Doesn’t appear to be any real correlation between hourly rate and attrition

## Warning: Removed 12 rows containing missing values (geom_point).

##Attrition by OverTime Working overtime appears to have a significant impact on attrition rate

##Attrition by Monthly Income

## Warning: Removed 813 rows containing missing values (geom_point).

##Choosing to test Bayes models with factor that had the most impact on attrtion Age, Job Satisfaction, Totalworkinyears, Job Role, and Hourly Rate

Models is 86% accurate but low on specificity

## Confusion Matrix and Statistics
## 
##      
##        No Yes
##   No  225   2
##   Yes  32   2
##                                           
##                Accuracy : 0.8697          
##                  95% CI : (0.8227, 0.9081)
##     No Information Rate : 0.9847          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.08            
##                                           
##  Mcnemar's Test P-Value : 6.577e-07       
##                                           
##             Sensitivity : 0.87549         
##             Specificity : 0.50000         
##          Pos Pred Value : 0.99119         
##          Neg Pred Value : 0.05882         
##              Prevalence : 0.98467         
##          Detection Rate : 0.86207         
##    Detection Prevalence : 0.86973         
##       Balanced Accuracy : 0.68774         
##                                           
##        'Positive' Class : No              
## 
## Confusion Matrix and Statistics
## 
##      
##        No Yes
##   No  226   1
##   Yes  34   0
##                                           
##                Accuracy : 0.8659          
##                  95% CI : (0.8185, 0.9048)
##     No Information Rate : 0.9962          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0075         
##                                           
##  Mcnemar's Test P-Value : 6.338e-08       
##                                           
##             Sensitivity : 0.8692          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9956          
##          Neg Pred Value : 0.0000          
##              Prevalence : 0.9962          
##          Detection Rate : 0.8659          
##    Detection Prevalence : 0.8697          
##       Balanced Accuracy : 0.4346          
##                                           
##        'Positive' Class : No              
## 
## Confusion Matrix and Statistics
## 
##      
##        No Yes
##   No  223   4
##   Yes  29   5
##                                          
##                Accuracy : 0.8736         
##                  95% CI : (0.827, 0.9113)
##     No Information Rate : 0.9655         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.1883         
##                                          
##  Mcnemar's Test P-Value : 2.943e-05      
##                                          
##             Sensitivity : 0.8849         
##             Specificity : 0.5556         
##          Pos Pred Value : 0.9824         
##          Neg Pred Value : 0.1471         
##              Prevalence : 0.9655         
##          Detection Rate : 0.8544         
##    Detection Prevalence : 0.8697         
##       Balanced Accuracy : 0.7202         
##                                          
##        'Positive' Class : No             
## 
## Confusion Matrix and Statistics
## 
##      
##        No Yes
##   No  224   3
##   Yes  30   4
##                                          
##                Accuracy : 0.8736         
##                  95% CI : (0.827, 0.9113)
##     No Information Rate : 0.9732         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.1577         
##                                          
##  Mcnemar's Test P-Value : 6.011e-06      
##                                          
##             Sensitivity : 0.8819         
##             Specificity : 0.5714         
##          Pos Pred Value : 0.9868         
##          Neg Pred Value : 0.1176         
##              Prevalence : 0.9732         
##          Detection Rate : 0.8582         
##    Detection Prevalence : 0.8697         
##       Balanced Accuracy : 0.7267         
##                                          
##        'Positive' Class : No             
## 
## Confusion Matrix and Statistics
## 
##      
##        No Yes
##   No  222   2
##   Yes  32   5
##                                           
##                Accuracy : 0.8697          
##                  95% CI : (0.8227, 0.9081)
##     No Information Rate : 0.9732          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1908          
##                                           
##  Mcnemar's Test P-Value : 6.577e-07       
##                                           
##             Sensitivity : 0.8740          
##             Specificity : 0.7143          
##          Pos Pred Value : 0.9911          
##          Neg Pred Value : 0.1351          
##              Prevalence : 0.9732          
##          Detection Rate : 0.8506          
##    Detection Prevalence : 0.8582          
##       Balanced Accuracy : 0.7942          
##                                           
##        'Positive' Class : No              
## 
## [1] 0.8406897
## [1] 0.002048626
## [1] 0.8509524
## [1] 0.002089431
## [1] 0.57255
## [1] 0.002089431
## Confusion Matrix and Statistics
## 
##      
##        No Yes
##   No  214   4
##   Yes  38   5
##                                           
##                Accuracy : 0.8391          
##                  95% CI : (0.7888, 0.8815)
##     No Information Rate : 0.9655          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1435          
##                                           
##  Mcnemar's Test P-Value : 3.543e-07       
##                                           
##             Sensitivity : 0.8492          
##             Specificity : 0.5556          
##          Pos Pred Value : 0.9817          
##          Neg Pred Value : 0.1163          
##              Prevalence : 0.9655          
##          Detection Rate : 0.8199          
##    Detection Prevalence : 0.8352          
##       Balanced Accuracy : 0.7024          
##                                           
##        'Positive' Class : No              
## 
## Confusion Matrix and Statistics
## 
##      
##        No Yes
##   No  211   7
##   Yes  37   6
##                                           
##                Accuracy : 0.8314          
##                  95% CI : (0.7804, 0.8748)
##     No Information Rate : 0.9502          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1492          
##                                           
##  Mcnemar's Test P-Value : 1.232e-05       
##                                           
##             Sensitivity : 0.8508          
##             Specificity : 0.4615          
##          Pos Pred Value : 0.9679          
##          Neg Pred Value : 0.1395          
##              Prevalence : 0.9502          
##          Detection Rate : 0.8084          
##    Detection Prevalence : 0.8352          
##       Balanced Accuracy : 0.6562          
##                                           
##        'Positive' Class : No              
## 
## [1] 0.842069
## [1] 0.002061901
## [1] 0.8508076
## [1] 0.002113658
## [1] 0.6040192
## [1] 0.002113658
## [1] 0.8407663
## [1] 0.00207462
## [1] 0.8507683
## [1] 0.002080485
## [1] 0.5916438
## [1] 0.002080485
## [1] 0.849387
## [1] 0.001940469
## [1] 0.8587652
## [1] 0.002087962
## [1] 0.6427007
## [1] 0.002087962

##Best Bayes model included Age, JobRole, JobSatisfaction, and Overtime Accuracy of 85%, sensitiviy of .85 and specificity of .65

## [1] 0.849387
## [1] 0.001940469
## [1] 0.8587652
## [1] 0.002087962
## [1] 0.6427007
## [1] 0.002087962

##Comparing against Knn Model

## integer(0)
## [1] NA
## [1] 0.8330268
## [1] 0.00218046
## [1] 0.8517895
## [1] 0.002144115
## [1] 0.4511064
## [1] 0.002144115

##Classifying Attrition for Test Data

##EDA for imputing Monthly Income

So far highest correlatoin is between Total working years and monthly income Total working years has a .779 corr while years at company has .491 corr JobLevel has a corr of .952 Age has a .485 correlation Years since last promotion has a .316 correlation

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##First Model for computing Monthly Incomes First model using Joblevel and income has a rmse of 1410.878

## 
## Call:
## lm(formula = MonthlyIncome ~ JobLevel, data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4642.2  -668.0  -107.3   668.3  4412.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2743.82      69.69   39.37   <2e-16 ***
## JobLevel2    2800.46      99.89   28.04   <2e-16 ***
## JobLevel3    7108.38     130.24   54.58   <2e-16 ***
## JobLevel4   12509.83     177.45   70.50   <2e-16 ***
## JobLevel5   16480.15     219.18   75.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1264 on 865 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9244 
## F-statistic:  2658 on 4 and 865 DF,  p-value: < 2.2e-16
##                 2.5 %    97.5 %
## (Intercept)  2607.044  2880.604
## JobLevel2    2604.402  2996.509
## JobLevel3    6852.766  7363.996
## JobLevel4   12161.551 12858.101
## JobLevel5   16049.957 16910.342
## 
## Call:
## lm(formula = MonthlyIncome ~ JobLevel, data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4642.2  -668.0  -107.3   668.3  4412.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2743.82      69.69   39.37   <2e-16 ***
## JobLevel2    2800.46      99.89   28.04   <2e-16 ***
## JobLevel3    7108.38     130.24   54.58   <2e-16 ***
## JobLevel4   12509.83     177.45   70.50   <2e-16 ***
## JobLevel5   16480.15     219.18   75.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1264 on 865 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9244 
## F-statistic:  2658 on 4 and 865 DF,  p-value: < 2.2e-16
## [1] 1216.151

##2nd Model ading TotalWorkingYears

Adding the totalworkingyears got a better error with 1365

## 
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + TotalWorkingYears, data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4957.9  -657.8  -134.6   618.2  4525.8 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2544.901     89.085  28.567  < 2e-16 ***
## JobLevel2          2652.205    107.666  24.634  < 2e-16 ***
## JobLevel3          6820.371    152.732  44.656  < 2e-16 ***
## JobLevel4         11858.212    254.564  46.582  < 2e-16 ***
## JobLevel5         15800.546    289.997  54.485  < 2e-16 ***
## TotalWorkingYears    33.442      9.426   3.548 0.000409 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1256 on 864 degrees of freedom
## Multiple R-squared:  0.9258, Adjusted R-squared:  0.9254 
## F-statistic:  2157 on 5 and 864 DF,  p-value: < 2.2e-16
##                         2.5 %      97.5 %
## (Intercept)        2370.05330  2719.74815
## JobLevel2          2440.88699  2863.52254
## JobLevel3          6520.60145  7120.13957
## JobLevel4         11358.57533 12357.84882
## JobLevel5         15231.36546 16369.72723
## TotalWorkingYears    14.94155    51.94211
## 
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + TotalWorkingYears, data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4957.9  -657.8  -134.6   618.2  4525.8 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2544.901     89.085  28.567  < 2e-16 ***
## JobLevel2          2652.205    107.666  24.634  < 2e-16 ***
## JobLevel3          6820.371    152.732  44.656  < 2e-16 ***
## JobLevel4         11858.212    254.564  46.582  < 2e-16 ***
## JobLevel5         15800.546    289.997  54.485  < 2e-16 ***
## TotalWorkingYears    33.442      9.426   3.548 0.000409 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1256 on 864 degrees of freedom
## Multiple R-squared:  0.9258, Adjusted R-squared:  0.9254 
## F-statistic:  2157 on 5 and 864 DF,  p-value: < 2.2e-16
## [1] 1203.668

##3rd Model adding age as well

Found that adding the factors with most Correllations, that being JobLevel, Age, TotalWorkingYears gave the lowes RMSE of around 1200.

## 
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + Age + TotalWorkingYears, 
##     data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4947.4  -652.9  -136.8   615.3  4542.1 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2417.795    197.719  12.228   <2e-16 ***
## JobLevel2          2653.840    107.720  24.636   <2e-16 ***
## JobLevel3          6825.867    152.965  44.624   <2e-16 ***
## JobLevel4         11869.515    255.118  46.525   <2e-16 ***
## JobLevel5         15811.576    290.482  54.432   <2e-16 ***
## Age                   4.548      6.316   0.720   0.4716    
## TotalWorkingYears    29.545     10.871   2.718   0.0067 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1256 on 863 degrees of freedom
## Multiple R-squared:  0.9259, Adjusted R-squared:  0.9254 
## F-statistic:  1797 on 6 and 863 DF,  p-value: < 2.2e-16
##                          2.5 %      97.5 %
## (Intercept)        2029.728420  2805.86250
## JobLevel2          2442.416087  2865.26407
## JobLevel3          6525.639779  7126.09387
## JobLevel4         11368.789547 12370.24015
## JobLevel5         15241.442578 16381.70959
## Age                  -7.847635    16.94390
## TotalWorkingYears     8.209551    50.88143
## 
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + Age + TotalWorkingYears, 
##     data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4947.4  -652.9  -136.8   615.3  4542.1 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2417.795    197.719  12.228   <2e-16 ***
## JobLevel2          2653.840    107.720  24.636   <2e-16 ***
## JobLevel3          6825.867    152.965  44.624   <2e-16 ***
## JobLevel4         11869.515    255.118  46.525   <2e-16 ***
## JobLevel5         15811.576    290.482  54.432   <2e-16 ***
## Age                   4.548      6.316   0.720   0.4716    
## TotalWorkingYears    29.545     10.871   2.718   0.0067 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1256 on 863 degrees of freedom
## Multiple R-squared:  0.9259, Adjusted R-squared:  0.9254 
## F-statistic:  1797 on 6 and 863 DF,  p-value: < 2.2e-16
## [1] 1203.884

##Imputing the values for Test Set